eval_data <- read.csv("https://raw.githubusercontent.com/uzmabb182/Data_621/refs/heads/main/Assignment1/moneyball-evaluation-data%20(1).csv")
train_data <- read.csv("https://raw.githubusercontent.com/uzmabb182/Data_621/refs/heads/main/Assignment1/moneyball-training-data%20(1).csv")
head(eval_data)
head(train_data)
The TARGET_WINS field represents the number of games a team won in a given baseball season.
This is the dependent variable (target variable) in the multiple linear regression model.
We are trying to predict TARGET_WINS based on other team performance metrics.
names(eval_data)
## [1] "INDEX" "TEAM_BATTING_H" "TEAM_BATTING_2B" "TEAM_BATTING_3B"
## [5] "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB"
## [9] "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H" "TEAM_PITCHING_HR"
## [13] "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
names(train_data)
## [1] "INDEX" "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B"
## [5] "TEAM_BATTING_3B" "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO"
## [9] "TEAM_BASERUN_SB" "TEAM_BASERUN_CS" "TEAM_BATTING_HBP" "TEAM_PITCHING_H"
## [13] "TEAM_PITCHING_HR" "TEAM_PITCHING_BB" "TEAM_PITCHING_SO" "TEAM_FIELDING_E"
## [17] "TEAM_FIELDING_DP"
We can see there are fields with null values
glimpse(eval_data)
## Rows: 259
## Columns: 16
## $ INDEX <int> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, 135, 138…
## $ TEAM_BATTING_H <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
## $ TEAM_BATTING_2B <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
## $ TEAM_BATTING_3B <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
## $ TEAM_BATTING_HR <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
## $ TEAM_BATTING_BB <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
## $ TEAM_BATTING_SO <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
## $ TEAM_BASERUN_SB <int> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 64, 48, 3…
## $ TEAM_BASERUN_CS <int> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 28, 21, 8…
## $ TEAM_BATTING_HBP <int> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H <int> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 1626, 1342,…
## $ TEAM_PITCHING_HR <int> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, 173, 196…
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
## $ TEAM_FIELDING_E <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
## $ TEAM_FIELDING_DP <int> 156, 164, 153, 154, 130, 105, NA, 104, 132, 145, 183,…
glimpse(train_data)
## Rows: 2,276
## Columns: 17
## $ INDEX <int> 1, 2, 3, 4, 5, 6, 7, 8, 11, 12, 13, 15, 16, 17, 18, 1…
## $ TARGET_WINS <int> 39, 70, 86, 70, 82, 75, 80, 85, 86, 76, 78, 68, 72, 7…
## $ TEAM_BATTING_H <int> 1445, 1339, 1377, 1387, 1297, 1279, 1244, 1273, 1391,…
## $ TEAM_BATTING_2B <int> 194, 219, 232, 209, 186, 200, 179, 171, 197, 213, 179…
## $ TEAM_BATTING_3B <int> 39, 22, 35, 38, 27, 36, 54, 37, 40, 18, 27, 31, 41, 2…
## $ TEAM_BATTING_HR <int> 13, 190, 137, 96, 102, 92, 122, 115, 114, 96, 82, 95,…
## $ TEAM_BATTING_BB <int> 143, 685, 602, 451, 472, 443, 525, 456, 447, 441, 374…
## $ TEAM_BATTING_SO <int> 842, 1075, 917, 922, 920, 973, 1062, 1027, 922, 827, …
## $ TEAM_BASERUN_SB <int> NA, 37, 46, 43, 49, 107, 80, 40, 69, 72, 60, 119, 221…
## $ TEAM_BASERUN_CS <int> NA, 28, 27, 30, 39, 59, 54, 36, 27, 34, 39, 79, 109, …
## $ TEAM_BATTING_HBP <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H <int> 9364, 1347, 1377, 1396, 1297, 1279, 1244, 1281, 1391,…
## $ TEAM_PITCHING_HR <int> 84, 191, 137, 97, 102, 92, 122, 116, 114, 96, 86, 95,…
## $ TEAM_PITCHING_BB <int> 927, 689, 602, 454, 472, 443, 525, 459, 447, 441, 391…
## $ TEAM_PITCHING_SO <int> 5456, 1082, 917, 928, 920, 973, 1062, 1033, 922, 827,…
## $ TEAM_FIELDING_E <int> 1011, 193, 175, 164, 138, 123, 136, 112, 127, 131, 11…
## $ TEAM_FIELDING_DP <int> NA, 155, 153, 156, 168, 149, 186, 136, 169, 159, 141,…
eval_df <- data.frame(eval_data)
train_df <- data.frame(train_data)
str(eval_df)
## 'data.frame': 259 obs. of 16 variables:
## $ INDEX : int 9 10 14 47 60 63 74 83 98 120 ...
## $ TEAM_BATTING_H : int 1209 1221 1395 1539 1445 1431 1430 1385 1259 1397 ...
## $ TEAM_BATTING_2B : int 170 151 183 309 203 236 219 158 177 212 ...
## $ TEAM_BATTING_3B : int 33 29 29 29 68 53 55 42 78 42 ...
## $ TEAM_BATTING_HR : int 83 88 93 159 5 10 37 33 23 58 ...
## $ TEAM_BATTING_BB : int 447 516 509 486 95 215 568 356 466 452 ...
## $ TEAM_BATTING_SO : int 1080 929 816 914 416 377 527 609 689 584 ...
## $ TEAM_BASERUN_SB : int 62 54 59 148 NA NA 365 185 150 52 ...
## $ TEAM_BASERUN_CS : int 50 39 47 57 NA NA NA NA NA NA ...
## $ TEAM_BATTING_HBP: int NA NA NA 42 NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 1209 1221 1395 1539 3902 2793 1544 1626 1342 1489 ...
## $ TEAM_PITCHING_HR: int 83 88 93 159 14 20 40 39 25 62 ...
## $ TEAM_PITCHING_BB: int 447 516 509 486 257 420 613 418 497 482 ...
## $ TEAM_PITCHING_SO: int 1080 929 816 914 1123 736 569 715 734 622 ...
## $ TEAM_FIELDING_E : int 140 135 156 124 616 572 490 328 226 184 ...
## $ TEAM_FIELDING_DP: int 156 164 153 154 130 105 NA 104 132 145 ...
str(train_df)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
train_df %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "#628B3A", color="#628B3A") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 3478 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 3478 rows containing missing values or values outside the scale range
## (`geom_point()`).
### Sorted Correlations
correlation_with_target <- cor(train_df, use = "complete.obs")["TARGET_WINS", ] %>%
sort(decreasing = TRUE) # Sort from highest to lowest correlation
print(correlation_with_target)
## TARGET_WINS TEAM_PITCHING_H TEAM_BATTING_H TEAM_BATTING_BB
## 1.00000000 0.47123431 0.46994665 0.46868793
## TEAM_PITCHING_BB TEAM_PITCHING_HR TEAM_BATTING_HR TEAM_BATTING_2B
## 0.46839882 0.42246683 0.42241683 0.31298400
## TEAM_BATTING_HBP TEAM_BASERUN_SB INDEX TEAM_BATTING_3B
## 0.07350424 0.01483639 -0.04895047 -0.12434586
## TEAM_BASERUN_CS TEAM_FIELDING_DP TEAM_BATTING_SO TEAM_PITCHING_SO
## -0.17875598 -0.19586601 -0.22889273 -0.22936481
## TEAM_FIELDING_E
## -0.38668800
library(ggplot2)
correlation_data <- data.frame(Variable = names(correlation_with_target), Correlation = correlation_with_target)
ggplot(correlation_data, aes(x = reorder(Variable, Correlation), y = Correlation, fill = Correlation > 0)) +
geom_bar(stat = "identity") +
coord_flip() + # Flip for better readability
labs(title = "Correlation with Target Wins", x = "Variables", y = "Correlation") +
scale_fill_manual(values = c("red", "blue")) +
theme_minimal()
### Checking Correlations and Finding Action:
Strong positive correlation (> 0.5) -> keep the variable in regression.
Strong negative correlation (< -0.5) -> Keep (inverse relationship).
Weak correlation (~0) -> Consider removing.
Two highly correlated variables (> 0.85) -> Drop one to avoid multicollinearity.
skim(eval_df)
| Name | eval_df |
| Number of rows | 259 |
| Number of columns | 16 |
| _______________________ | |
| Column type frequency: | |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 0 | 1.00 | 1263.93 | 693.29 | 9 | 708.0 | 1249.0 | 1832.50 | 2525 | ▆▆▆▇▅ |
| TEAM_BATTING_H | 0 | 1.00 | 1469.39 | 150.66 | 819 | 1387.0 | 1455.0 | 1548.00 | 2170 | ▁▂▇▁▁ |
| TEAM_BATTING_2B | 0 | 1.00 | 241.32 | 49.52 | 44 | 210.0 | 239.0 | 278.50 | 376 | ▁▂▇▇▂ |
| TEAM_BATTING_3B | 0 | 1.00 | 55.91 | 27.14 | 14 | 35.0 | 52.0 | 72.00 | 155 | ▇▇▃▁▁ |
| TEAM_BATTING_HR | 0 | 1.00 | 95.63 | 56.33 | 0 | 44.5 | 101.0 | 135.50 | 242 | ▆▅▇▃▁ |
| TEAM_BATTING_BB | 0 | 1.00 | 498.96 | 120.59 | 15 | 436.5 | 509.0 | 565.50 | 792 | ▁▁▅▇▁ |
| TEAM_BATTING_SO | 18 | 0.93 | 709.34 | 243.11 | 0 | 545.0 | 686.0 | 912.00 | 1268 | ▁▃▇▇▂ |
| TEAM_BASERUN_SB | 13 | 0.95 | 123.70 | 93.39 | 0 | 59.0 | 92.0 | 151.75 | 580 | ▇▃▁▁▁ |
| TEAM_BASERUN_CS | 87 | 0.66 | 52.32 | 23.10 | 0 | 38.0 | 49.5 | 63.00 | 154 | ▂▇▃▁▁ |
| TEAM_BATTING_HBP | 240 | 0.07 | 62.37 | 12.71 | 42 | 53.5 | 62.0 | 67.50 | 96 | ▃▇▅▁▁ |
| TEAM_PITCHING_H | 0 | 1.00 | 1813.46 | 1662.91 | 1155 | 1426.5 | 1515.0 | 1681.00 | 22768 | ▇▁▁▁▁ |
| TEAM_PITCHING_HR | 0 | 1.00 | 102.15 | 57.65 | 0 | 52.0 | 104.0 | 142.50 | 336 | ▇▇▆▁▁ |
| TEAM_PITCHING_BB | 0 | 1.00 | 552.42 | 172.95 | 136 | 471.0 | 526.0 | 606.50 | 2008 | ▆▇▁▁▁ |
| TEAM_PITCHING_SO | 18 | 0.93 | 799.67 | 634.31 | 0 | 613.0 | 745.0 | 938.00 | 9963 | ▇▁▁▁▁ |
| TEAM_FIELDING_E | 0 | 1.00 | 249.75 | 230.90 | 73 | 131.0 | 163.0 | 252.00 | 1568 | ▇▁▁▁▁ |
| TEAM_FIELDING_DP | 31 | 0.88 | 146.06 | 25.88 | 69 | 131.0 | 148.0 | 164.00 | 204 | ▁▂▇▇▂ |
skim(train_df)
| Name | train_df |
| Number of rows | 2276 |
| Number of columns | 17 |
| _______________________ | |
| Column type frequency: | |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| INDEX | 0 | 1.00 | 1268.46 | 736.35 | 1 | 630.75 | 1270.5 | 1915.50 | 2535 | ▇▇▇▇▇ |
| TARGET_WINS | 0 | 1.00 | 80.79 | 15.75 | 0 | 71.00 | 82.0 | 92.00 | 146 | ▁▁▇▅▁ |
| TEAM_BATTING_H | 0 | 1.00 | 1469.27 | 144.59 | 891 | 1383.00 | 1454.0 | 1537.25 | 2554 | ▁▇▂▁▁ |
| TEAM_BATTING_2B | 0 | 1.00 | 241.25 | 46.80 | 69 | 208.00 | 238.0 | 273.00 | 458 | ▁▆▇▂▁ |
| TEAM_BATTING_3B | 0 | 1.00 | 55.25 | 27.94 | 0 | 34.00 | 47.0 | 72.00 | 223 | ▇▇▂▁▁ |
| TEAM_BATTING_HR | 0 | 1.00 | 99.61 | 60.55 | 0 | 42.00 | 102.0 | 147.00 | 264 | ▇▆▇▅▁ |
| TEAM_BATTING_BB | 0 | 1.00 | 501.56 | 122.67 | 0 | 451.00 | 512.0 | 580.00 | 878 | ▁▁▇▇▁ |
| TEAM_BATTING_SO | 102 | 0.96 | 735.61 | 248.53 | 0 | 548.00 | 750.0 | 930.00 | 1399 | ▁▆▇▇▁ |
| TEAM_BASERUN_SB | 131 | 0.94 | 124.76 | 87.79 | 0 | 66.00 | 101.0 | 156.00 | 697 | ▇▃▁▁▁ |
| TEAM_BASERUN_CS | 772 | 0.66 | 52.80 | 22.96 | 0 | 38.00 | 49.0 | 62.00 | 201 | ▃▇▁▁▁ |
| TEAM_BATTING_HBP | 2085 | 0.08 | 59.36 | 12.97 | 29 | 50.50 | 58.0 | 67.00 | 95 | ▂▇▇▅▁ |
| TEAM_PITCHING_H | 0 | 1.00 | 1779.21 | 1406.84 | 1137 | 1419.00 | 1518.0 | 1682.50 | 30132 | ▇▁▁▁▁ |
| TEAM_PITCHING_HR | 0 | 1.00 | 105.70 | 61.30 | 0 | 50.00 | 107.0 | 150.00 | 343 | ▇▇▆▁▁ |
| TEAM_PITCHING_BB | 0 | 1.00 | 553.01 | 166.36 | 0 | 476.00 | 536.5 | 611.00 | 3645 | ▇▁▁▁▁ |
| TEAM_PITCHING_SO | 102 | 0.96 | 817.73 | 553.09 | 0 | 615.00 | 813.5 | 968.00 | 19278 | ▇▁▁▁▁ |
| TEAM_FIELDING_E | 0 | 1.00 | 246.48 | 227.77 | 65 | 127.00 | 159.0 | 249.25 | 1898 | ▇▁▁▁▁ |
| TEAM_FIELDING_DP | 286 | 0.87 | 146.39 | 26.23 | 52 | 131.00 | 149.0 | 164.00 | 228 | ▁▂▇▆▁ |
summary(train_df)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0
## 1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0
## Median :1270.5 Median : 82.00 Median :1454 Median :238.0
## Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2
## 3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0
## Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0
##
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0 1st Qu.: 548.0
## Median : 47.00 Median :102.00 Median :512.0 Median : 750.0
## Mean : 55.25 Mean : 99.61 Mean :501.6 Mean : 735.6
## 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0 3rd Qu.: 930.0
## Max. :223.00 Max. :264.00 Max. :878.0 Max. :1399.0
## NA's :102
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H
## Min. : 0.0 Min. : 0.0 Min. :29.00 Min. : 1137
## 1st Qu.: 66.0 1st Qu.: 38.0 1st Qu.:50.50 1st Qu.: 1419
## Median :101.0 Median : 49.0 Median :58.00 Median : 1518
## Mean :124.8 Mean : 52.8 Mean :59.36 Mean : 1779
## 3rd Qu.:156.0 3rd Qu.: 62.0 3rd Qu.:67.00 3rd Qu.: 1682
## Max. :697.0 Max. :201.0 Max. :95.00 Max. :30132
## NA's :131 NA's :772 NA's :2085
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E
## Min. : 0.0 Min. : 0.0 Min. : 0.0 Min. : 65.0
## 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 615.0 1st Qu.: 127.0
## Median :107.0 Median : 536.5 Median : 813.5 Median : 159.0
## Mean :105.7 Mean : 553.0 Mean : 817.7 Mean : 246.5
## 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 968.0 3rd Qu.: 249.2
## Max. :343.0 Max. :3645.0 Max. :19278.0 Max. :1898.0
## NA's :102
## TEAM_FIELDING_DP
## Min. : 52.0
## 1st Qu.:131.0
## Median :149.0
## Mean :146.4
## 3rd Qu.:164.0
## Max. :228.0
## NA's :286
summary(eval_df)
## INDEX TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## Min. : 9 Min. : 819 Min. : 44.0 Min. : 14.00
## 1st Qu.: 708 1st Qu.:1387 1st Qu.:210.0 1st Qu.: 35.00
## Median :1249 Median :1455 Median :239.0 Median : 52.00
## Mean :1264 Mean :1469 Mean :241.3 Mean : 55.91
## 3rd Qu.:1832 3rd Qu.:1548 3rd Qu.:278.5 3rd Qu.: 72.00
## Max. :2525 Max. :2170 Max. :376.0 Max. :155.00
##
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## Min. : 0.00 Min. : 15.0 Min. : 0.0 Min. : 0.0
## 1st Qu.: 44.50 1st Qu.:436.5 1st Qu.: 545.0 1st Qu.: 59.0
## Median :101.00 Median :509.0 Median : 686.0 Median : 92.0
## Mean : 95.63 Mean :499.0 Mean : 709.3 Mean :123.7
## 3rd Qu.:135.50 3rd Qu.:565.5 3rd Qu.: 912.0 3rd Qu.:151.8
## Max. :242.00 Max. :792.0 Max. :1268.0 Max. :580.0
## NA's :18 NA's :13
## TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
## Min. : 0.00 Min. :42.00 Min. : 1155 Min. : 0.0
## 1st Qu.: 38.00 1st Qu.:53.50 1st Qu.: 1426 1st Qu.: 52.0
## Median : 49.50 Median :62.00 Median : 1515 Median :104.0
## Mean : 52.32 Mean :62.37 Mean : 1813 Mean :102.1
## 3rd Qu.: 63.00 3rd Qu.:67.50 3rd Qu.: 1681 3rd Qu.:142.5
## Max. :154.00 Max. :96.00 Max. :22768 Max. :336.0
## NA's :87 NA's :240
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## Min. : 136.0 Min. : 0.0 Min. : 73.0 Min. : 69.0
## 1st Qu.: 471.0 1st Qu.: 613.0 1st Qu.: 131.0 1st Qu.:131.0
## Median : 526.0 Median : 745.0 Median : 163.0 Median :148.0
## Mean : 552.4 Mean : 799.7 Mean : 249.7 Mean :146.1
## 3rd Qu.: 606.5 3rd Qu.: 938.0 3rd Qu.: 252.0 3rd Qu.:164.0
## Max. :2008.0 Max. :9963.0 Max. :1568.0 Max. :204.0
## NA's :18 NA's :31
train_df %>%
summarise(across(where(is.numeric), list(mean = mean, median = median, sd = sd), na.rm = TRUE))
## Warning: There was 1 warning in `summarise()`.
## ℹ In argument: `across(...)`.
## Caused by warning:
## ! The `...` argument of `across()` is deprecated as of dplyr 1.1.0.
## Supply arguments directly to `.fns` through an anonymous function instead.
##
## # Previously
## across(a:b, mean, na.rm = TRUE)
##
## # Now
## across(a:b, \(x) mean(x, na.rm = TRUE))
The summarise(across()) function above is calculating the mean, median, ans standard deviation for all the numerical variables.
Before fitting a multiple linear regression (MLR) model, we analyze the dataset for potential issues such as missing values, extreme outliers, multicollinearity, and variable distributions.
Here’s what we can infer from the summary statistics:
Range: 0 to 146 wins
Mean: ~80.79 wins
Median: 82 wins
Distribution: The min value of 0 and max of 146 suggest some potential outliers or erroneous data points, since most teams win between 50-110 games in a season.
Some variables have a significant number of missing values. In particular:
Additionally, four variables have values of zero (0) reported that appear suspicious. In particular: - TEAM_BATTING_SO & TEAM_PITHCING_SO have the same rows entered as zero suggesting that data may not have been available for these entries. - TEAM_BATTING_HR & TEAM_PITHCING_HR have the same rows entered as zero suggesting that data may not have been available for these entries.
Actionable Steps:
Impute missing values (using mean/median or regression techniques).
Consider removing TEAM_BATTING_HBP and TEAM_BASERUN_CS if they are highly incomplete and do not contribute much.
Several variables have extreme max values that seem unrealistic. Specifically:
TEAM_PITCHING_H (Max = 30,132) <- Likely an error since typical values range from 1,200 - 1,700.
TEAM_PITCHING_SO (Max = 19,278) <- Suspiciously high (typical range: 500 - 1,500).
TEAM_PITCHING_BB (Max = 3,645) <- Very high (typical range: 300 - 700).
TEAM_FIELDING_E (Max = 1,898) <- Likely an error since the normal range is ~ 70-200.
Actionable Steps:
Check for data entry errors.
Remove extreme outliers if they distort model performance.
Batting Variables: TEAM_BATTING_H, TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR, TEAM_BATTING_BB are likely strong predictors of team wins. Note that TEAM_BATTING_H includes TEAM_BATTING_2B, TEAM_BATTING_3B, TEAM_BATTING_HR data.
Pitching Variables: TEAM_PITCHING_H, TEAM_PITCHING_HR, TEAM_PITCHING_BB, TEAM_PITCHING_SO will impact defensive strength.
Fielding Variables: TEAM_FIELDING_E (errors) and TEAM_FIELDING_DP (double plays) may have a weaker impact compared to batting and pitching.
Actionable Steps:
TEAM_PITCHING_H, TEAM_PITCHING_HR, and TEAM_PITCHING_BB may be highly correlated, which can cause multicollinearity in the regression model.
TEAM_BATTING_H, TEAM_BATTING_2B, and TEAM_BATTING_3B may also be strongly correlated since total hits include doubles and triples.
Handle Missing Values
Consider dropping or imputing variables with too many missing values (e.g., TEAM_BATTING_HBP).
Impute TEAM_BASERUN_SB, TEAM_BASERUN_CS, and TEAM_FIELDING_DP appropriately.
Remove highly unrealistic values in pitching, fielding, and errors.
Use Variance Inflation Factor (VIF) to detect multicollinearity and drop redundant features.
Consider derived metrics like batting average (H/AB), on-base percentage (OBP), or earned run average (ERA) instead of raw counts.
The dataset contains inconsistencies, missing values, and extreme outliers that need to be addressed before fitting an MLR model.
Once cleaned, feature selection and multicollinearity checks will be essential to ensure a robust and interpretable model for predicting team wins.
boxplot(train_df$TARGET_WINS, main="Distribution of Team Wins", ylab="Wins", col="lightblue")
This box plot provides valuable insights into the distribution of team wins in the training dataset.
Here’s what we can infer:
The thick horizontal line inside the box represents the median (~82 wins).
The box itself (Interquartile Range - IQR) shows the middle 50% of the data, which seems to range roughly from 70 to 92 wins.
Low-end outliers (~0-40 wins): There are several small circles (outliers) below the lower whisker.
High-end outliers (~110-146 wins): There are some outliers above the upper whisker, but visually fewer than the low-end.
These low-win teams might be problematic for modeling because they could represent incomplete or missing data.
Potential data entry issues (e.g., a team with 0 wins) should be checked.
If extreme values skew the regression, we might need transformations (log scaling)
The box is fairly centered, suggesting a roughly symmetric distribution.
hist(train_df$TARGET_WINS,
main = "Distribution of Team Wins",
xlab = "Wins",
ylab = "Frequency",
col = "steelblue",
border = "black",
breaks = 20) # number of bins
### Histogram with Density Curve
hist(train_df$TARGET_WINS,
main = "Histogram of Team Wins with Density Curve",
xlab = "Wins",
col = "lightgray",
border = "black",
breaks = 20,
probability = TRUE) # Converts y-axis to density
lines(density(train_df$TARGET_WINS, na.rm = TRUE),
col = "red",
lwd = 2) # Adds a red density curve
## Data Preparation
#install.packages("naniar") # Specialized for missing data visualization
#install.packages("visdat") # Another missing value visualization package
library(naniar)
##
## Attaching package: 'naniar'
## The following object is masked from 'package:skimr':
##
## n_complete
library(visdat)
library(dplyr)
library(tidyr) # tidyr is loaded to use pivot_longer()
missing_values <- train_df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count")
print(missing_values)
## # A tibble: 17 × 2
## Variable Missing_Count
## <chr> <int>
## 1 INDEX 0
## 2 TARGET_WINS 0
## 3 TEAM_BATTING_H 0
## 4 TEAM_BATTING_2B 0
## 5 TEAM_BATTING_3B 0
## 6 TEAM_BATTING_HR 0
## 7 TEAM_BATTING_BB 0
## 8 TEAM_BATTING_SO 102
## 9 TEAM_BASERUN_SB 131
## 10 TEAM_BASERUN_CS 772
## 11 TEAM_BATTING_HBP 2085
## 12 TEAM_PITCHING_H 0
## 13 TEAM_PITCHING_HR 0
## 14 TEAM_PITCHING_BB 0
## 15 TEAM_PITCHING_SO 102
## 16 TEAM_FIELDING_E 0
## 17 TEAM_FIELDING_DP 286
ggplot(missing_values, aes(y = reorder(Variable, Missing_Count), x = Missing_Count, fill = Missing_Count > 0)) +
geom_col() +
labs(title = "Missing Values in Training Dataset",
x = "Number of Missing Values",
y = "Variables") +
scale_fill_manual(values = c("gray", "red"), labels = c("No Missing", "Has Missing")) +
theme_minimal()
### Strategy for Missing Values Column wise:
There are four main options for handling missing values:
If a column has too many missing values (e.g., >50% missing), it may be better to remove it.
The TEAM_BATTING_HBP column is mostly empty and not critical.
library(naniar)
library(visdat)
library(dplyr)
library(tidyr) # tidyr is loaded to use pivot_longer()
train_df <- train_df[, !names(train_df) %in% "INDEX"]
train_df <- train_df[, !names(train_df) %in% "TEAM_BATTING_HBP"]
train_df
If the TARGET_WINS column has missing values, remove those rows since we can’t predict missing outcomes.
train_df <- train_df %>% filter(!is.na(TARGET_WINS))
train_df
If a column has many missing values, but we want to remove it, we create a new feature that flags missing values:
train_df <- train_df %>%
mutate(TEAM_BASERUN_SB_Missing = ifelse(is.na(TEAM_BASERUN_SB), 1, 0))
train_df$TEAM_BASERUN_SB_Missing
## [1] 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [38] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0
## [75] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [149] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [186] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [223] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [260] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [297] 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [334] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [371] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 1 0 1
## [408] 1 1 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [445] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [482] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [519] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [556] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [593] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [630] 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [667] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [704] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [741] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [778] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [815] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [852] 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
## [889] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [926] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [963] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## [1000] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1037] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1074] 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1111] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1148] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1185] 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1222] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1259] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1296] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1333] 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1370] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0
## [1407] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1444] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1481] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1518] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1555] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1
## [1592] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1629] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1666] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1
## [1703] 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1740] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1777] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
## [1814] 1 1 1 1 1 1 1 1 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1851] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1888] 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1925] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1962] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [1999] 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2036] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2073] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2110] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0
## [2147] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2184] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2221] 1 1 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2258] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
sum(is.na(train_df)) # Total missing values
## [1] 1393
colSums(is.na(train_df)) # Missing values per column
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## 0 0 0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 102 131 772
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 0 0 0
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 102 0 286
## TEAM_BASERUN_SB_Missing
## 0
Missing Values Cause Errors in Regression Models
lm() in R cannot handle missing values and will return an error if NAs exist in predictor variables.
Removing missing values ensures that the model runs smoothly without interruptions.
# Remove missing values
train_df1 <- na.omit(train_df)
train_df1
sum(is.na(train_df1)) # Total missing values
## [1] 0
colSums(is.na(train_df1)) # Missing values per column
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## 0 0 0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 0 0 0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 0 0 0
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 0 0
## TEAM_BASERUN_SB_Missing
## 0
train_df1 %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "#628B3A", color="#628B3A") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")
## `geom_smooth()` using formula = 'y ~ x'
#plot_qq(train_df1, sampled_rows = 1000L)
If too many rows have missing values, R removes them automatically, reducing sample size.
If missing values are not randomly distributed, removing them may bias the dataset.
Instead of na.omit(), imputation methods (like mean/median filling) may be better for handling missing data.
Removing rows with missing data is sometimes not the best approach, especially if a large portion of data is lost. Here are alternative methods:
Instead of removing missing values, fill them with the mean or median:
train_df$TEAM_BATTING_SO[is.na(train_df$TEAM_BATTING_SO)] <- mean(train_df$TEAM_BATTING_SO, na.rm = TRUE)
train_df$TEAM_BASERUN_SB[is.na(train_df$TEAM_BASERUN_SB)] <- mean(train_df$TEAM_BASERUN_SB, na.rm = TRUE)
train_df$TEAM_BASERUN_CS[is.na(train_df$TEAM_BASERUN_CS)] <- mean(train_df$TEAM_BASERUN_CS, na.rm = TRUE)
train_df$TEAM_PITCHING_SO[is.na(train_df$TEAM_PITCHING_SO)] <- mean(train_df$TEAM_PITCHING_SO, na.rm = TRUE)
train_df$TEAM_FIELDING_DP[is.na(train_df$TEAM_FIELDING_DP)] <- mean(train_df$TEAM_FIELDING_DP, na.rm = TRUE)
plot_qq(train_df, sampled_rows = 1000L)
train_df_clean <- train_df
sum(is.na(train_df)) # Total missing values
## [1] 0
colSums(is.na(train_df)) # Missing values per column
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## 0 0 0
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB
## 0 0 0
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 0 0 0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 0 0 0
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 0 0
## TEAM_BASERUN_SB_Missing
## 0
base_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = train_df)
# View model summary
summary(base_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_PITCHING_SO + TEAM_FIELDING_E, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.488 -9.350 0.078 9.258 50.323
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.7153321 3.2275839 2.700 0.00698 **
## TEAM_BATTING_H 0.0521590 0.0022114 23.587 < 2e-16 ***
## TEAM_BATTING_HR -0.0018746 0.0062230 -0.301 0.76326
## TEAM_PITCHING_SO 0.0010454 0.0005718 1.828 0.06764 .
## TEAM_FIELDING_E -0.0212115 0.0016889 -12.560 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.78 on 2271 degrees of freedom
## Multiple R-squared: 0.2363, Adjusted R-squared: 0.2349
## F-statistic: 175.6 on 4 and 2271 DF, p-value: < 2.2e-16
TEAM_BATTING_H (Hits): More hits increase the chances of scoring.
TEAM_BATTING_HR (Home Runs): Home runs are a major contributor to runs.
TEAM_PITCHING_SO (Strikeouts): More strikeouts reduce opponent scoring.
TEAM_FIELDING_E (Errors): More errors lead to more opponent runs (negative predictor).
Why exclude some variables?
TEAM_BASERUN_SB (Stolen Bases): Limited impact on overall wins.
TEAM_PITCHING_BB (Walks Allowed): May not be as predictive when combined with strikeouts.
1- Intercept (8.715)
When all predictor variables (TEAM_BATTING_H, TEAM_BATTING_HR, TEAM_PITCHING_SO, TEAM_FIELDING_E) are zero, a team is expected to have 8.72 wins. Significant (p = 0.007) → The intercept is meaningful in this context.
2- TEAM_BATTING_H (Hits) → +0.052 (p < 2e-16 *)
positive & highly significant Interpretation: For every additional hit, the team is expected to win 0.052 more games. Example: If a team gets 100 more hits in a season, they would be expected to win ~5 more games (100 × 0.052). Conclusion: More hits lead to more wins, which is expected in baseball.
3- TEAM_BATTING_HR (Home Runs) → -0.00187 (p = 0.76 - Not Significant) Negative (unexpected) & not significant
Interpretation: More home runs slightly decrease wins, but the effect is very small and not statistically significant. Example: Hitting 100 more home runs would decrease wins by 0.187, which doesn’t make sense.
Possible Issues: Multicollinearity: Home runs may be highly correlated with other batting stats (like hits or doubles), causing misleading coefficients. Outliers/Bad Teams: Some losing teams hit a lot of home runs but still lost, skewing results.
Solution: Check VIF (Variance Inflation Factor) for multicollinearity. Add an interaction term (e.g., TEAM_BATTING_HR * TEAM_BATTING_BB). Consider removing this variable if it remains insignificant.
4_ TEAM_PITCHING_SO (Strikeouts by Pitchers) → +0.0010 (p = 0.067 .) Weakly positive, borderline significant
Interpretation: More strikeouts slightly increase wins, but the effect is very small and only marginally significant (p = 0.067). Example: If a team strikes out 500 more batters in a season, they would win 0.5 more games.
Conclusion: Strikeouts help teams win, but they are not the strongest predictor of wins. The effect might be hidden by other defensive factors (e.g., walks, home runs allowed).
Solution: Consider adding walks (TEAM_PITCHING_BB) or earned run average (ERA) to capture pitching effectiveness better.
5- TEAM_FIELDING_E (Errors) → -0.0212 (p < 2e-16 *) Negative & highly significant
Interpretation: For every additional error, a team is expected to lose 0.021 more games. Example: A team with 50 more errors in a season would lose ~1 more game (50 × 0.021). Conclusion: More errors directly hurt a team’s chances of winning, which makes sense in baseball.
Residual Std. Error 13.78 The average prediction error is ~13.78 wins.
Adjusted R² 0.2349 The model explains 23.5% of variance in TARGET_WINS (not very strong).
F-Statistic 175.6 (p < 2.2e-16) The overall model is statistically significant.
p < 2.2e-16 → The probability of getting this result by random chance is essentially 0 (very small).
At least one of the predictor variables in the model significantly affects TARGET_WINS.
If the p-value is very small (< 0.05), we reject the null hypothesis that “none of the independent variables explain wins.”
Our model as a whole is meaningful and explains a significant amount of variation in team wins.
At least one of our predictors (TEAM_BATTING_H, TEAM_BATTING_HR, TEAM_PITCHING_SO, TEAM_FIELDING_E) is statistically significant in predicting wins.
The F-statistic of 175.6 (p < 2.2e-16) means our model is highly statistically significant. This confirms that at least one of our variables—such as home runs, hits, strikeouts, or errors—has a real impact on predicting wins.
However, we still need to check which specific variables are the most meaningful (p-values of individual coefficients) and whether we can improve the model further.
Multicollinearity occurs when predictor variables are highly correlated, leading to unstable coefficients and inflated standard errors.
We use the Variance Inflation Factor (VIF) test. A VIF > 5 suggests multicollinearity, and VIF > 10 is a strong sign of redundancy.
library(car)
vif_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = train_df)
vif(vif_model)
## TEAM_BATTING_H TEAM_BATTING_HR TEAM_PITCHING_SO TEAM_FIELDING_E
## 1.225189 1.701317 1.144897 1.773312
Since all VIF values are below 5, there is no significant multicollinearity in the model. This means:
Each predictor contributes unique information to explaining TARGET_WINS. Regression coefficients are stable, and we do not need to remove any variables due to multicollinearity. The model is not distorted by highly correlated predictors.
If a team hits more home runs and draws more walks, they likely score more runs. We test if walks amplify the impact of home runs on wins.
interaction_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_BATTING_HR:TEAM_BATTING_BB +
TEAM_PITCHING_SO + TEAM_FIELDING_E,
data = train_df)
summary(interaction_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_BATTING_HR:TEAM_BATTING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.921 -8.709 0.033 9.237 52.279
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.517e+01 4.099e+00 3.702 0.000219 ***
## TEAM_BATTING_H 5.186e-02 2.186e-03 23.724 < 2e-16 ***
## TEAM_BATTING_HR -1.871e-01 2.724e-02 -6.866 8.49e-12 ***
## TEAM_BATTING_BB -8.253e-03 4.894e-03 -1.686 0.091872 .
## TEAM_PITCHING_SO 1.039e-03 5.650e-04 1.839 0.066076 .
## TEAM_FIELDING_E -2.354e-02 2.211e-03 -10.644 < 2e-16 ***
## TEAM_BATTING_HR:TEAM_BATTING_BB 3.188e-04 4.735e-05 6.733 2.10e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.57 on 2269 degrees of freedom
## Multiple R-squared: 0.2603, Adjusted R-squared: 0.2584
## F-statistic: 133.1 on 6 and 2269 DF, p-value: < 2.2e-16
Model now includes an interaction term (TEAM_BATTING_HR:TEAM_BATTING_BB) to see if walks (BB) affect the impact of home runs (HR) on wins. Let’s break down the results.
The average prediction error is ±13.31 wins, slightly better than before (13.78). The model explains 27.2% of the variance in wins (up from 23.6% in the original model). Similar to R², meaning additional predictors added value to the model. The model as a whole is statistically significant (at least one predictor explains wins).
More Home Runs (HR) Alone → Fewer Wins (Unexpected): The negative coefficient (-0.1650) on TEAM_BATTING_HR suggests that hitting more home runs alone does not necessarily lead to more wins.
Walks (BB) Alone Have a Weak Impact on Wins: The coefficient for TEAM_BATTING_BB is negative (-0.0074) and not statistically significant (p = 0.1387). This means that walks alone do not have a strong impact on wins.
The Interaction Term (TEAM_BATTING_HR * TEAM_BATTING_BB) is Highly Significant (p = 3.39e-10) Positive Coefficient (+0.000301)
Teams that hit home runs AND get on base with walks tend to win more games. This confirms that home runs are more valuable when combined with walks.
We want to see how home runs (TEAM_BATTING_HR) and walks (TEAM_BATTING_BB) impact wins (TARGET_WINS) together.
library(ggplot2)
ggplot(train_df, aes(x = TEAM_BATTING_HR, y = TARGET_WINS, color = TEAM_BATTING_BB)) +
geom_point(alpha = 0.7) +
geom_smooth(method = "lm", se = FALSE, color = "black") +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Interaction Effect of Home Runs and Walks on Wins",
x = "Home Runs",
y = "Wins",
color = "Walks (BB)") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Red (high walks) teams should have higher wins for the same HRs. Blue
(low walks) teams may not benefit as much from HRs. The trendline is
steeper for teams with more walks, confirming that walks amplify HR
impact.
Doubles (2B) are a strong indicator of offensive power and often correlate with scoring more runs. If a team doesn’t hit home runs, but hits many doubles, it can still score efficiently.
improved_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_BATTING_2B + TEAM_BATTING_HR:TEAM_BATTING_BB +
TEAM_PITCHING_SO + TEAM_FIELDING_E,
data = train_df)
summary(improved_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_BATTING_2B + TEAM_BATTING_HR:TEAM_BATTING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.744 -8.679 0.175 8.803 55.228
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.110e+01 4.181e+00 2.654 0.00802 **
## TEAM_BATTING_H 6.090e-02 2.961e-03 20.566 < 2e-16 ***
## TEAM_BATTING_HR -1.805e-01 2.717e-02 -6.642 3.86e-11 ***
## TEAM_BATTING_BB -8.276e-03 4.874e-03 -1.698 0.08962 .
## TEAM_BATTING_2B -4.128e-02 9.170e-03 -4.501 7.09e-06 ***
## TEAM_PITCHING_SO 1.670e-03 5.799e-04 2.880 0.00401 **
## TEAM_FIELDING_E -2.575e-02 2.256e-03 -11.413 < 2e-16 ***
## TEAM_BATTING_HR:TEAM_BATTING_BB 3.214e-04 4.715e-05 6.817 1.19e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.51 on 2268 degrees of freedom
## Multiple R-squared: 0.2669, Adjusted R-squared: 0.2646
## F-statistic: 117.9 on 7 and 2268 DF, p-value: < 2.2e-16
The average prediction error is ±13.25 wins (slightly better than before). The model explains 27.9% of the variance in wins (slightly better than 27.2% in the previous model). Adjusted for number of predictors (still an improvement from before). The model is statistically significant overall.
Model performance improved slightly (R² increased from 27.2% → 27.9%). Doubles (TEAM_BATTING_2B) unexpectedly have a negative impact on wins.
Possible issue: Doubles may be highly correlated with other batting stats (e.g., hits, HRs). Solution: We will Check multicollinearity (VIF) or add an interaction term (e.g., TEAM_BATTING_2B * TEAM_BATTING_H). HRs alone are still negative (-0.1612), but the interaction term remains strong. Conclusion: HRs are only useful when paired with walks.
Adding TEAM_BATTING_2B slightly improves model performance
R² increased from 27.2% → 27.9% (small improvement). Residual Standard Error decreased from 13.31 → 13.25 (better fit). Unexpected negative coefficient for doubles (-0.0429, p = 3.09e-06)
Suggests that more doubles lead to fewer wins, which is counterintuitive. Possible reasons: Multicollinearity with TEAM_BATTING_H (hits). Bad teams might hit many doubles but still lose. Interaction term (TEAM_BATTING_HR * TEAM_BATTING_BB) remains strong and positive
Confirms that HRs are more valuable when combined with walks. Suggests plate discipline (BBs) is crucial for power hitters. Decision: Should we keep TEAM_BATTING_2B?
If VIF test shows high correlation with TEAM_BATTING_H, we should drop it. If interaction terms (e.g., TEAM_BATTING_2B * TEAM_BATTING_H) make sense, we could try that instead.
We select variables based on correlation with TARGET_WINS and ensure they are not highly correlated with each other (VIF < 5).
library(car)
# Manually selected high-impact variables
high_impact_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR +
TEAM_PITCHING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = train_df)
# View model summary
summary(high_impact_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_HR + TEAM_PITCHING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E,
## data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.201 -9.120 0.166 9.119 52.347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.3004601 3.5021314 1.513 0.13029
## TEAM_BATTING_H 0.0605230 0.0030785 19.660 < 2e-16 ***
## TEAM_BATTING_2B -0.0396466 0.0093245 -4.252 2.21e-05 ***
## TEAM_BATTING_HR -0.0059120 0.0230824 -0.256 0.79788
## TEAM_PITCHING_HR 0.0115696 0.0215103 0.538 0.59072
## TEAM_PITCHING_SO 0.0016054 0.0005943 2.701 0.00696 **
## TEAM_FIELDING_E -0.0235977 0.0018188 -12.974 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.73 on 2269 degrees of freedom
## Multiple R-squared: 0.2425, Adjusted R-squared: 0.2405
## F-statistic: 121 on 6 and 2269 DF, p-value: < 2.2e-16
# Check for multicollinearity
vif(high_impact_model)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_HR TEAM_PITCHING_HR
## 2.391698 2.298929 23.577777 20.987169
## TEAM_PITCHING_SO TEAM_FIELDING_E
## 1.245605 2.071742
TEAM_BATTING_2B (Doubles): Important for offensive production.
TEAM_PITCHING_HR (Home Runs Allowed): Directly impacts opponent scoring (negative impact).
Removed highly correlated variables (VIF > 5).
What changed?
Better feature selection → Based on both domain knowledge and correlation analysis.
Removes redundant variables that cause multicollinearity.
# Apply log transformation to selected variables
train_df <- train_df %>%
mutate(
log_BATTING_H = log1p(TEAM_BATTING_H),
log_BATTING_HR = log1p(TEAM_BATTING_HR),
log_PITCHING_SO = log1p(TEAM_PITCHING_SO)
)
# Fit model with transformed variables
log_model <- lm(TARGET_WINS ~ log_BATTING_H + log_BATTING_HR + log_PITCHING_SO + TEAM_FIELDING_E, data = train_df)
# View model summary
summary(log_model)
##
## Call:
## lm(formula = TARGET_WINS ~ log_BATTING_H + log_BATTING_HR + log_PITCHING_SO +
## TEAM_FIELDING_E, data = train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -53.951 -9.135 0.181 9.355 50.090
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.900e+02 2.507e+01 -19.546 <2e-16 ***
## log_BATTING_H 7.921e+01 3.416e+00 23.184 <2e-16 ***
## log_BATTING_HR -6.581e-01 4.895e-01 -1.344 0.179
## log_PITCHING_SO 2.231e-01 4.737e-01 0.471 0.638
## TEAM_FIELDING_E -2.109e-02 1.964e-03 -10.741 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.8 on 2271 degrees of freedom
## Multiple R-squared: 0.2341, Adjusted R-squared: 0.2328
## F-statistic: 173.6 on 4 and 2271 DF, p-value: < 2.2e-16
Fixes right-skewed distributions (e.g., extreme HR and SO values).
Helps meet the linear regression assumption of normality.
Reduces outlier influence.
What changed?
More stable regression coefficients with reduced variability.
Improves the model fit for non-linear relationships.
summary(base_model)$adj.r.squared
## [1] 0.2349286
summary(high_impact_model)$adj.r.squared
## [1] 0.2404727
summary(log_model)$adj.r.squared
## [1] 0.2327997
mse_base <- mean((train_df$TARGET_WINS - predict(base_model, train_df))^2)
mse_high_impact <- mean((train_df$TARGET_WINS - predict(high_impact_model, train_df))^2)
mse_log <- mean((train_df$TARGET_WINS - predict(log_model, train_df))^2)
print(c(mse_base, mse_high_impact, mse_log))
## [1] 189.4204 187.8821 189.9474
vif(high_impact_model)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_HR TEAM_PITCHING_HR
## 2.391698 2.298929 23.577777 20.987169
## TEAM_PITCHING_SO TEAM_FIELDING_E
## 1.245605 2.071742
For this model, we are using backward selection to arrive at the smallest number of variables with statistical significance. In this case, we begin by creating a model containing all predictors, then gradually remove the variable with the highest P-value until only the variables with statistical significance remain.
The remaining variables TEAM_BATTING_SO, TEAM_PITCHING_SO, TEAM_BATTING_HR, and TEAM_PITCHING_HR included a several observations with a value of zero. As the rows were the same for both variables, it appears that these may also be missing observations as the likelihood that a team’s batters did not have a single strikeout nor did their pitchers pitch a single strikeout over the course of a 162 game season is highly unlikely. We will therefore treat these as missing observations and impute using the median value which is more robust to outliers then using the mean values. We will also drop the single row where the team did not win a single game, as this is also suspicious.
library(miscTools)
# Convert dubious stats to NAs for pitching
# and drop unnecessary columns
train_df_clean <- train_df_clean |>
mutate(
TEAM_BATTING_SO = if_else(TEAM_BATTING_SO > 0, TEAM_BATTING_SO, NA_integer_),
TEAM_PITCHING_SO = if_else(TEAM_PITCHING_SO > 0, TEAM_PITCHING_SO, NA_integer_),
TEAM_BATTING_HR = if_else(TEAM_BATTING_HR > 0, TEAM_BATTING_HR, NA_integer_),
TEAM_PITCHING_HR = if_else(TEAM_PITCHING_HR > 0, TEAM_PITCHING_HR, NA_integer_)
) |>
filter(TARGET_WINS > 0) |>
drop_columns(c("TEAM_BASERUN_SB_Missing"))
# get medians
train_median_val <- colMedians(train_df_clean, na.rm = TRUE)
# Impute using Means
for(i in colnames(train_df_clean))
train_df_clean[,i][is.na(train_df_clean[,i])] <- as.integer(train_median_val[i])
sum(is.na(train_df_clean)) # Total missing values
## [1] 0
colSums(is.na(train_df_clean)) # Missing values per column
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
## 0 0 0 0
## TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
## 0 0 0 0
## TEAM_BASERUN_CS TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB
## 0 0 0 0
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 0 0
We split the original training dataset into a training and testing dataset in order to test the strength of our model without testing the model on data that the model has already seen. The training dataset will hold 75% of our original observations, while the test dataset will hold 25%.
smp_size <- floor(0.75 * nrow(train_df_clean))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(train_df_clean)), size = smp_size)
stp75_train_df <- train_df_clean[train_ind, ]
stp25_test_df <- train_df_clean[-train_ind, ]
The Residuals vs. Fitted and QQ Plots show a fairly linear pattern, while Scale-Location plot suggest Homoscedasticity. However, the Residuals vs Leverage plot reveals the presence of some outliers.
stp_model_full <- lm(TARGET_WINS ~ ., data = stp75_train_df)
summary(stp_model_full)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = stp75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.282 -8.645 0.048 8.329 57.187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8474117 6.3024098 2.832 0.004683 **
## TEAM_BATTING_H 0.0514104 0.0044714 11.498 < 2e-16 ***
## TEAM_BATTING_2B -0.0223261 0.0106295 -2.100 0.035842 *
## TEAM_BATTING_3B 0.0670262 0.0199886 3.353 0.000816 ***
## TEAM_BATTING_HR 0.0282412 0.0316737 0.892 0.372718
## TEAM_BATTING_BB 0.0066123 0.0067402 0.981 0.326723
## TEAM_BATTING_SO -0.0035742 0.0029068 -1.230 0.219018
## TEAM_BASERUN_SB 0.0278688 0.0051806 5.379 8.52e-08 ***
## TEAM_BASERUN_CS -0.0137241 0.0181722 -0.755 0.450220
## TEAM_PITCHING_H -0.0008462 0.0004836 -1.750 0.080333 .
## TEAM_PITCHING_HR 0.0243807 0.0289192 0.843 0.399312
## TEAM_PITCHING_BB 0.0017888 0.0047790 0.374 0.708223
## TEAM_PITCHING_SO 0.0021175 0.0009865 2.147 0.031969 *
## TEAM_FIELDING_E -0.0223339 0.0028407 -7.862 6.68e-15 ***
## TEAM_FIELDING_DP -0.1065312 0.0149243 -7.138 1.40e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.07 on 1691 degrees of freedom
## Multiple R-squared: 0.3031, Adjusted R-squared: 0.2973
## F-statistic: 52.53 on 14 and 1691 DF, p-value: < 2.2e-16
plot(stp_model_full)
# get points of influence
influence <- influence.measures(stp_model_full)
influential_points <- influence$infmat
cooks_d <- influence$infmat[, "cook.d"]
max_influence_index <- which.max(cooks_d)
The observation with index 2135 is particularly problematic. A closer examination reveals TEAM_PITCHING_SO is almost 75% higher as the next highest value (19278 vs 12758). TEAM_PITCHING_H is also unusually hight for this year.
influential_data_point <- stp75_train_df[max_influence_index, ]
print(influential_data_point)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 2135 41 992 263 20 103
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 2135 142 952 124.7618 52.80386
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 2135 20088 108 2876 19278
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 2135 952 146.3879
We will drop this row since it has such a high leverage on the model.
stp75_train_df <- stp75_train_df |>
filter(TEAM_PITCHING_SO < 15000)
# confirm outliers
stp_model_full <- lm(TARGET_WINS ~ ., data = stp75_train_df)
plot(stp_model_full)
Charting the plot shows another point (202) with high influence. However, inspecting the this point further doesn’t reveal any glaring issues, so it is better to leave the point in the dataset and possibly address the outlier through a transformation.
# get points of influence
influence <- influence.measures(stp_model_full)
influential_points <- influence$infmat
cooks_d <- influence$infmat[, "cook.d"]
max_influence_index <- which.max(cooks_d)
influential_data_point <- stp75_train_df[max_influence_index, ]
print(influential_data_point)
## TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 202 108 1188 338 0 103
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 202 270 945 124.7618 52.80386
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 202 16038 108 3645 12758
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 202 716 146.3879
We will use Best Subset Selection to get a sense of the optimal number of predictors in our model. This test will use Mallows’ Cp.
library(leaps)
regfit_full = regsubsets(TARGET_WINS ~ ., data = stp75_train_df, nvmax = 11)
regfit_summary = summary(regfit_full)
plot(regfit_summary$cp, xlab="Number of variables", ylab="cp")
points(which.min(regfit_summary$cp),regfit_summary$cp[which.min(regfit_summary$cp)], pch=20,col="red")
Based on the Best Subset Selection method, we estimate that our
model should have 11 observations.
As an alternative to Best Subset Selection, we used the Cross Validation method to estimate the optimal number of predictors in our model. Cross Validation divides our training dataset into k - 1 number of “folds”, then tests the data on the kth “fold”. For our test, we used five folds (k = 5).
set.seed(11)
folds=sample(rep(1:5,length=nrow(stp75_train_df)))
cv_errors = matrix(NA,5,10)
for(k in 1:5) {
best_fit = regsubsets(TARGET_WINS ~ ., data=stp75_train_df[folds!=k,], nvmax=10, method="forward")
for(i in 1:10) {
# Extract the selected coefficients for the i-th model
selected_coefs = coef(best_fit, id = i)
# Predict manually by calculating the linear combination of the features
# First, subset the data for the k-th fold
test_data = stp75_train_df[folds == k, ]
# Only include the predictors that were selected
predictors = names(selected_coefs)[-1] # Exclude the intercept term
# Calculate the predictions (including the intercept)
pred = as.matrix(test_data[, predictors]) %*% selected_coefs[predictors] + selected_coefs[1]
cv_errors[k,i]=mean((stp75_train_df$TARGET_WINS[folds==k] - pred)^2)
}
}
rmse_cv = sqrt(apply(cv_errors,2,mean))
plot(rmse_cv, pch=5, type="b")
Based on the Cross Validation method, we estimate that our
model should have 8 observations.
summary(stp_model_full)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = stp75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.620 -8.601 0.052 8.421 57.406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.0717019 6.2861603 2.875 0.004093 **
## TEAM_BATTING_H 0.0508525 0.0044631 11.394 < 2e-16 ***
## TEAM_BATTING_2B -0.0231877 0.0106049 -2.187 0.028915 *
## TEAM_BATTING_3B 0.0686399 0.0199424 3.442 0.000592 ***
## TEAM_BATTING_HR 0.0271056 0.0315921 0.858 0.391022
## TEAM_BATTING_BB 0.0158109 0.0073265 2.158 0.031064 *
## TEAM_BATTING_SO -0.0085203 0.0032952 -2.586 0.009803 **
## TEAM_BASERUN_SB 0.0294916 0.0051924 5.680 1.59e-08 ***
## TEAM_BASERUN_CS -0.0145075 0.0181259 -0.800 0.423606
## TEAM_PITCHING_H -0.0005380 0.0004921 -1.093 0.274438
## TEAM_PITCHING_HR 0.0272622 0.0288572 0.945 0.344933
## TEAM_PITCHING_BB -0.0062491 0.0054035 -1.156 0.247643
## TEAM_PITCHING_SO 0.0062409 0.0016350 3.817 0.000140 ***
## TEAM_FIELDING_E -0.0230493 0.0028423 -8.109 9.67e-16 ***
## TEAM_FIELDING_DP -0.1057973 0.0148867 -7.107 1.74e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 1690 degrees of freedom
## Multiple R-squared: 0.3045, Adjusted R-squared: 0.2988
## F-statistic: 52.86 on 14 and 1690 DF, p-value: < 2.2e-16
AIC(stp_model_full)
## [1] 13610.63
TEAM_BATTING_HR has the highest p-value. We removed TEAM_BATTING_HR from our predictors and update the model.
stp_model <- update(stp_model_full, . ~ . - TEAM_BATTING_HR)
summary(stp_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.556 -8.641 0.089 8.405 57.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.8537601 6.2805359 2.843 0.004527 **
## TEAM_BATTING_H 0.0508576 0.0044627 11.396 < 2e-16 ***
## TEAM_BATTING_2B -0.0230698 0.0106032 -2.176 0.029712 *
## TEAM_BATTING_3B 0.0663806 0.0197662 3.358 0.000802 ***
## TEAM_BATTING_BB 0.0183056 0.0067243 2.722 0.006550 **
## TEAM_BATTING_SO -0.0083971 0.0032919 -2.551 0.010833 *
## TEAM_BASERUN_SB 0.0295425 0.0051917 5.690 1.49e-08 ***
## TEAM_BASERUN_CS -0.0159538 0.0180460 -0.884 0.376786
## TEAM_PITCHING_H -0.0004697 0.0004856 -0.967 0.333568
## TEAM_PITCHING_HR 0.0504241 0.0101961 4.945 8.35e-07 ***
## TEAM_PITCHING_BB -0.0081166 0.0049454 -1.641 0.100929
## TEAM_PITCHING_SO 0.0064775 0.0016114 4.020 6.08e-05 ***
## TEAM_FIELDING_E -0.0233451 0.0028211 -8.275 2.57e-16 ***
## TEAM_FIELDING_DP -0.1052461 0.0148717 -7.077 2.15e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 1691 degrees of freedom
## Multiple R-squared: 0.3042, Adjusted R-squared: 0.2989
## F-statistic: 56.88 on 13 and 1691 DF, p-value: < 2.2e-16
AIC(stp_model)
## [1] 13609.38
TEAM_BASERUN_CS has the highest p-value. We removed TEAM_BASERUN_CS from our predictors and update the model.
stp_model <- update(stp_model, . ~ . - TEAM_BASERUN_CS)
summary(stp_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.461 -8.664 0.030 8.334 57.089
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.7922558 6.1642946 2.724 0.006514 **
## TEAM_BATTING_H 0.0508731 0.0044624 11.400 < 2e-16 ***
## TEAM_BATTING_2B -0.0234865 0.0105920 -2.217 0.026730 *
## TEAM_BATTING_3B 0.0664649 0.0197647 3.363 0.000789 ***
## TEAM_BATTING_BB 0.0186458 0.0067129 2.778 0.005536 **
## TEAM_BATTING_SO -0.0083331 0.0032908 -2.532 0.011424 *
## TEAM_BASERUN_SB 0.0284459 0.0050410 5.643 1.96e-08 ***
## TEAM_PITCHING_H -0.0005012 0.0004843 -1.035 0.300809
## TEAM_PITCHING_HR 0.0519835 0.0100417 5.177 2.53e-07 ***
## TEAM_PITCHING_BB -0.0080575 0.0049446 -1.630 0.103383
## TEAM_PITCHING_SO 0.0064719 0.0016113 4.017 6.17e-05 ***
## TEAM_FIELDING_E -0.0227886 0.0027498 -8.287 2.33e-16 ***
## TEAM_FIELDING_DP -0.1057265 0.0148608 -7.114 1.65e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 1692 degrees of freedom
## Multiple R-squared: 0.3039, Adjusted R-squared: 0.299
## F-statistic: 61.56 on 12 and 1692 DF, p-value: < 2.2e-16
AIC(stp_model)
## [1] 13608.16
TEAM_PITCHING_H has the highest p-value. We removed TEAM_PITCHING_H from our predictors and update the model.
stp_model <- update(stp_model, . ~ . - TEAM_PITCHING_H)
summary(stp_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.197 -8.531 -0.043 8.327 57.840
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.206678 6.151406 2.797 0.005213 **
## TEAM_BATTING_H 0.049754 0.004330 11.492 < 2e-16 ***
## TEAM_BATTING_2B -0.023045 0.010584 -2.177 0.029587 *
## TEAM_BATTING_3B 0.070990 0.019276 3.683 0.000238 ***
## TEAM_BATTING_BB 0.021631 0.006062 3.568 0.000369 ***
## TEAM_BATTING_SO -0.008439 0.003289 -2.566 0.010383 *
## TEAM_BASERUN_SB 0.029503 0.004936 5.977 2.77e-09 ***
## TEAM_PITCHING_HR 0.051418 0.010027 5.128 3.27e-07 ***
## TEAM_PITCHING_BB -0.010623 0.004278 -2.483 0.013122 *
## TEAM_PITCHING_SO 0.006665 0.001601 4.164 3.28e-05 ***
## TEAM_FIELDING_E -0.024038 0.002471 -9.729 < 2e-16 ***
## TEAM_FIELDING_DP -0.105280 0.014855 -7.087 2.00e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.03 on 1693 degrees of freedom
## Multiple R-squared: 0.3035, Adjusted R-squared: 0.2989
## F-statistic: 67.05 on 11 and 1693 DF, p-value: < 2.2e-16
AIC(stp_model)
## [1] 13607.24
Backward selection using P-values arrives at a model with seven variables of statistical significance — six strong ((p-value < 0.001) and one moderate (p-value ~= 0.1). The nine predictors is consistent with our Best Subset Selection test.
Performing a variance inflation factor (VIF) test on the model shows two variables (TEAM_BATTING_BB and TEAM_BATTING_SO) with high correlation. We may choose to drop one or both of these variables to fine tune our model. This will increase our AIC score somewhat but may avoid multicolinearity issues in our model.
library(performance)
vif(stp_model)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_BB
## 3.599364 2.462555 2.748013 5.487126
## TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_HR TEAM_PITCHING_BB
## 5.757712 1.732534 3.815306 4.909374
## TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 4.252946 3.003044 1.345211
#plot(stp_model)
# Remove "TEAM_BATTING_BB"
stp_model2 <- update(stp_model, . ~ . - TEAM_BATTING_BB)
vif(stp_model2)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_SO
## 3.598363 2.459607 2.740642 5.004152
## TEAM_BASERUN_SB TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 1.711815 3.815292 2.002579 2.549136
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 1.992665 1.317035
Another approach to model selection involves a combination of step-wise selection and contextual selection. In this example, we can first drop TEAM_BATTING_2B (2 base hits), TEAM_BATTING_3B (3 base hits), TEAM_BATTING_HR (home runs) before conducting our stepwise selection, as data for these columns is represented in the TEAM_BATTING_H (total hits) column, thus signaling the existence of dependence between hit count variables and total hits. It is not totally clear if TEAM_PITCHING_HR is also included in TEAM_PITCHING_H from the supporting document but we will remove it in the event that a similar relationship exists. Alternatively, we could have created a new column (TEAM_BATTING_1B) by subtracting the sum of 2 base hits, 3 base hits, and homeruns from total hits, but this alternatively solution would have a) added complexity into our model and b) added uncertainty due to possible missing values in the home runs columns.
stp75s_train_df <- stp75_train_df |>
drop_columns(c("TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_HR", "TEAM_PITCHING_HR"))
stp25s_test_df <- stp25_test_df |>
drop_columns(c("TEAM_BATTING_2B", "TEAM_BATTING_3B", "TEAM_BATTING_HR", "TEAM_PITCHING_HR"))
stp_model3 <- lm(TARGET_WINS ~ ., data = stp75s_train_df)
# Backward step-wise regression
stpb_model3 <- step(stp_model3, direction = "backward")
## Start: AIC=8797.81
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_PITCHING_BB 1 99 293244 8796.4
## - TEAM_PITCHING_H 1 222 293367 8797.1
## <none> 293145 8797.8
## - TEAM_BATTING_SO 1 360 293505 8797.9
## - TEAM_BASERUN_CS 1 545 293690 8799.0
## - TEAM_BATTING_BB 1 1176 294321 8802.6
## - TEAM_PITCHING_SO 1 1893 295038 8806.8
## - TEAM_BASERUN_SB 1 6394 299539 8832.6
## - TEAM_FIELDING_DP 1 7603 300748 8839.5
## - TEAM_FIELDING_E 1 11594 304739 8861.9
## - TEAM_BATTING_H 1 66719 359864 9145.4
##
## Step: AIC=8796.39
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BATTING_SO 1 262 293506 8795.9
## <none> 293244 8796.4
## - TEAM_BASERUN_CS 1 527 293771 8797.4
## - TEAM_PITCHING_H 1 561 293806 8797.6
## - TEAM_BATTING_BB 1 2184 295429 8807.0
## - TEAM_PITCHING_SO 1 2865 296109 8811.0
## - TEAM_BASERUN_SB 1 6336 299581 8830.8
## - TEAM_FIELDING_DP 1 7665 300910 8838.4
## - TEAM_FIELDING_E 1 11621 304866 8860.7
## - TEAM_BATTING_H 1 68755 361999 9153.5
##
## Step: AIC=8795.91
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E +
## TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## <none> 293506 8795.9
## - TEAM_BASERUN_CS 1 427 293933 8796.4
## - TEAM_PITCHING_H 1 451 293957 8796.5
## - TEAM_BATTING_BB 1 2198 295705 8806.6
## - TEAM_PITCHING_SO 1 2864 296370 8810.5
## - TEAM_BASERUN_SB 1 6380 299886 8830.6
## - TEAM_FIELDING_DP 1 7902 301408 8839.2
## - TEAM_FIELDING_E 1 11474 304980 8859.3
## - TEAM_BATTING_H 1 74236 367742 9178.4
Performing an automated step-wise backward selection gives us 8 viable predictors. Examining the VIF shows that eight parameters are not strongly correlated with our dependent variable.
vif(stpb_model3)
## TEAM_BATTING_H TEAM_BATTING_BB TEAM_BASERUN_SB TEAM_BASERUN_CS
## 1.385157 2.082994 1.566271 1.145926
## TEAM_PITCHING_H TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 2.447477 1.334044 3.430504 1.295700
Looking at the p-values for the remaining 8 predictors show two variables with non-statistical significance (TEAM_BASERUN_CS and TEAM_PITCHING_H). We could remove the one with highest p-value or both of the these variables.
summary(stpb_model3)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stp75s_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.417 -8.933 0.159 8.448 51.313
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3124047 4.6045415 1.588 0.112453
## TEAM_BATTING_H 0.0561558 0.0027113 20.711 < 2e-16 ***
## TEAM_BATTING_BB 0.0134383 0.0037704 3.564 0.000375 ***
## TEAM_BASERUN_SB 0.0287687 0.0047382 6.072 1.56e-09 ***
## TEAM_BASERUN_CS -0.0277376 0.0176666 -1.570 0.116588
## TEAM_PITCHING_H -0.0006408 0.0003969 -1.614 0.106633
## TEAM_PITCHING_SO 0.0036813 0.0009049 4.068 4.96e-05 ***
## TEAM_FIELDING_E -0.0217060 0.0026658 -8.142 7.43e-16 ***
## TEAM_FIELDING_DP -0.0994484 0.0147173 -6.757 1.93e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.16 on 1696 degrees of freedom
## Multiple R-squared: 0.2889, Adjusted R-squared: 0.2856
## F-statistic: 86.14 on 8 and 1696 DF, p-value: < 2.2e-16
AIC(stpb_model3)
## [1] 13636.49
Here, we have removed the TEAM_BASERUN_CS. Our AIC goes up and Adjusted R-squared goes down slightly, but overall the models are somewhat similar. TEAM_PITCHING_H now has a p-value of 0.0855 which is only somewhat higher than out .05 threshold.
stpb_model4 <- update(stpb_model3, . ~ . - TEAM_BASERUN_CS)
summary(stpb_model4)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E +
## TEAM_FIELDING_DP, data = stp75s_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.253 -9.012 0.092 8.496 51.129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.2189307 4.4091592 1.184 0.2367
## TEAM_BATTING_H 0.0562394 0.0027120 20.737 < 2e-16 ***
## TEAM_BATTING_BB 0.0144777 0.0037134 3.899 0.0001 ***
## TEAM_BASERUN_SB 0.0264450 0.0045031 5.873 5.15e-09 ***
## TEAM_PITCHING_H -0.0006818 0.0003962 -1.721 0.0855 .
## TEAM_PITCHING_SO 0.0038350 0.0009000 4.261 2.15e-05 ***
## TEAM_FIELDING_E -0.0209251 0.0026201 -7.986 2.54e-15 ***
## TEAM_FIELDING_DP -0.0993356 0.0147235 -6.747 2.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.16 on 1697 degrees of freedom
## Multiple R-squared: 0.2879, Adjusted R-squared: 0.2849
## F-statistic: 98 on 7 and 1697 DF, p-value: < 2.2e-16
AIC(stpb_model4)
## [1] 13636.97
Removing the TEAM_PITCHING_H increases our AIC and decreases the Adjusted R-squared slightly, but overall the models are somewhat similar the auto-generated step-wise selected model.
stpb_model5 <- update(stpb_model4, . ~ . - TEAM_PITCHING_H)
summary(stpb_model5)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_BB +
## TEAM_BASERUN_SB + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP,
## data = stp75s_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -47.617 -9.035 -0.013 8.421 52.082
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.4142804 4.2229387 1.756 0.079317 .
## TEAM_BATTING_H 0.0546040 0.0025414 21.485 < 2e-16 ***
## TEAM_BATTING_BB 0.0143595 0.0037149 3.865 0.000115 ***
## TEAM_BASERUN_SB 0.0284735 0.0043485 6.548 7.71e-11 ***
## TEAM_PITCHING_SO 0.0032001 0.0008214 3.896 0.000102 ***
## TEAM_FIELDING_E -0.0235091 0.0021482 -10.943 < 2e-16 ***
## TEAM_FIELDING_DP -0.0995380 0.0147315 -6.757 1.93e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.17 on 1698 degrees of freedom
## Multiple R-squared: 0.2866, Adjusted R-squared: 0.2841
## F-statistic: 113.7 on 6 and 1698 DF, p-value: < 2.2e-16
AIC(stpb_model5)
## [1] 13637.94
The function compare_performance allows comparison of various metrics across diffrent models including R2 and R2 (adj.). It also weights AIC and BIC values to help rank the performance of our models. Based on the result of this test, it appears that our original model performed best.
compare_performance(stp_model, stp_model2, stpb_model3, stpb_model4, stpb_model5, rank = TRUE)
Some of our predictors (TEAM_PITCHING_SO, TEAM_PITCHING_BB, & TEAM_PITCHING_H) showed the presence of outliers. We will transform these variables with a log transformation.
stplg75_train_df <- stp75_train_df %>%
mutate(
TEAM_PITCHING_SO = log1p(TEAM_PITCHING_SO),
TEAM_PITCHING_BB = log1p(TEAM_PITCHING_BB),
TEAM_PITCHING_H = log1p(TEAM_PITCHING_H),
)
stplg25_test_df <- stp25_test_df %>%
mutate(
TEAM_PITCHING_SO = log1p(TEAM_PITCHING_SO),
TEAM_PITCHING_BB = log1p(TEAM_PITCHING_BB),
TEAM_PITCHING_H = log1p(TEAM_PITCHING_H),
)
stplg75_train_df %>%
gather(variable, value, -TARGET_WINS) %>%
ggplot(., aes(value, TARGET_WINS)) +
geom_point(fill = "#628B3A", color="#628B3A") +
geom_smooth(method = "lm", se = FALSE, color = "black") +
facet_wrap(~variable, scales ="free", ncol = 4) +
labs(x = element_blank(), y = "Wins")
## `geom_smooth()` using formula = 'y ~ x'
Our scatter plots show some improvement in our outliers.
We will run Best Subset Selection & Cross-Validation tests again to get a sense of the ideal number of predictors.
##
library(leaps)
regfit_full = regsubsets(TARGET_WINS ~ ., data = stplg75_train_df, nvmax = 11)
regfit_summary = summary(regfit_full)
plot(regfit_summary$cp, xlab="Number of variables", ylab="cp")
points(which.min(regfit_summary$cp), regfit_summary$cp[which.min(regfit_summary$cp)], pch=20,col="red")
## CV
set.seed(11)
folds=sample(rep(1:5,length=nrow(stplg75_train_df)))
cv_errors = matrix(NA,5,10)
for(k in 1:5) {
best_fit = regsubsets(TARGET_WINS ~ ., data=stplg75_train_df[folds!=k,], nvmax=10, method="forward")
for(i in 1:10) {
# Extract the selected coefficients for the i-th model
selected_coefs = coef(best_fit, id = i)
# Predict manually by calculating the linear combination of the features
# First, subset the data for the k-th fold
test_data = stplg75_train_df[folds == k, ]
# Only include the predictors that were selected
predictors = names(selected_coefs)[-1] # Exclude the intercept term
# Calculate the predictions (including the intercept)
pred = as.matrix(test_data[, predictors]) %*% selected_coefs[predictors] + selected_coefs[1]
cv_errors[k,i]=mean((stplg75_train_df$TARGET_WINS[folds==k] - pred)^2)
}
}
rmse_cv = sqrt(apply(cv_errors,2,mean))
plot(rmse_cv, pch=5, type="b")
We get similar number of predictors after testing for Best Subset Selection & Cross-Validation, with 11 and 9 observations respectively.
We will use backward step-wise selection to build our model.
stplg_model_full <- lm(TARGET_WINS ~ ., data = stplg75_train_df)
# Backward step-wise regression
stplg_model <- step(stplg_model_full, direction = "backward")
## Start: AIC=8713.15
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_BASERUN_CS 1 138.8 277778 8712.0
## - TEAM_PITCHING_HR 1 180.3 277819 8712.3
## <none> 277639 8713.1
## - TEAM_BATTING_2B 1 512.4 278151 8714.3
## - TEAM_BATTING_HR 1 991.6 278630 8717.2
## - TEAM_PITCHING_H 1 3373.1 281012 8731.7
## - TEAM_BATTING_SO 1 3893.5 281532 8734.9
## - TEAM_BATTING_3B 1 4412.3 282051 8738.0
## - TEAM_PITCHING_SO 1 4828.8 282467 8740.5
## - TEAM_BATTING_H 1 6290.3 283929 8749.3
## - TEAM_FIELDING_DP 1 6605.1 284244 8751.2
## - TEAM_PITCHING_BB 1 8105.2 285744 8760.2
## - TEAM_BATTING_BB 1 9217.2 286856 8766.8
## - TEAM_BASERUN_SB 1 9612.0 287251 8769.2
## - TEAM_FIELDING_E 1 18523.1 296162 8821.3
##
## Step: AIC=8712
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## - TEAM_PITCHING_HR 1 182.5 277960 8711.1
## <none> 277778 8712.0
## - TEAM_BATTING_2B 1 541.9 278319 8713.3
## - TEAM_BATTING_HR 1 1051.0 278829 8716.4
## - TEAM_PITCHING_H 1 3279.0 281056 8730.0
## - TEAM_BATTING_SO 1 3998.4 281776 8734.4
## - TEAM_BATTING_3B 1 4424.3 282202 8736.9
## - TEAM_PITCHING_SO 1 4984.2 282762 8740.3
## - TEAM_BATTING_H 1 6412.4 284190 8748.9
## - TEAM_FIELDING_DP 1 6685.8 284463 8750.6
## - TEAM_PITCHING_BB 1 8055.8 285833 8758.7
## - TEAM_BATTING_BB 1 9270.7 287048 8766.0
## - TEAM_BASERUN_SB 1 9612.3 287390 8768.0
## - TEAM_FIELDING_E 1 18609.7 296387 8820.6
##
## Step: AIC=8711.12
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + TEAM_FIELDING_E +
## TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## <none> 277960 8711.1
## - TEAM_BATTING_2B 1 511.7 278472 8712.3
## - TEAM_PITCHING_H 1 3096.7 281057 8728.0
## - TEAM_BATTING_HR 1 3529.3 281489 8730.6
## - TEAM_BATTING_SO 1 3882.3 281842 8732.8
## - TEAM_BATTING_3B 1 4282.7 282243 8735.2
## - TEAM_PITCHING_SO 1 4813.9 282774 8738.4
## - TEAM_BATTING_H 1 6434.2 284394 8748.1
## - TEAM_FIELDING_DP 1 6816.1 284776 8750.4
## - TEAM_PITCHING_BB 1 8385.4 286345 8759.8
## - TEAM_BATTING_BB 1 9484.0 287444 8766.3
## - TEAM_BASERUN_SB 1 9522.1 287482 8766.6
## - TEAM_FIELDING_E 1 18681.2 296641 8820.0
Looking at the summary for this model shows that all predictors have high statistic significance.
summary(stplg_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.628 -8.317 -0.193 7.985 61.171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -63.641214 17.214581 -3.697 0.000225 ***
## TEAM_BATTING_H 0.036536 0.005838 6.258 4.92e-10 ***
## TEAM_BATTING_2B -0.018417 0.010435 -1.765 0.077752 .
## TEAM_BATTING_3B 0.102137 0.020004 5.106 3.66e-07 ***
## TEAM_BATTING_HR 0.049758 0.010735 4.635 3.84e-06 ***
## TEAM_BATTING_BB 0.068369 0.008998 7.598 4.95e-14 ***
## TEAM_BATTING_SO -0.025447 0.005235 -4.861 1.27e-06 ***
## TEAM_BASERUN_SB 0.038504 0.005057 7.613 4.41e-14 ***
## TEAM_PITCHING_H 18.750182 4.318645 4.342 1.50e-05 ***
## TEAM_PITCHING_BB -27.300228 3.821159 -7.144 1.34e-12 ***
## TEAM_PITCHING_SO 18.096505 3.342998 5.413 7.08e-08 ***
## TEAM_FIELDING_E -0.039108 0.003667 -10.664 < 2e-16 ***
## TEAM_FIELDING_DP -0.094437 0.014661 -6.441 1.54e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.82 on 1692 degrees of freedom
## Multiple R-squared: 0.3266, Adjusted R-squared: 0.3218
## F-statistic: 68.38 on 12 and 1692 DF, p-value: < 2.2e-16
The Residuals vs. Fitted and QQ Plots show a fairly linear pattern, while Scale-Location plot suggest Homoscedasticity. The Residuals vs Leverage plot reveals that our outliers have less leverage than in the pre- log transformed model.
plot(stplg_model)
stplg75_train_df <- stplg75_train_df |>
mutate(
n = row_number()
)
A VIF Test shows several variables that are highly correlated, including TEAM_PITCHING_H, TEAM_BATTING_BB, TEAM_PITCHING_SO, TEAM_PITCHING_BB, TEAM_FIELDING_E & TEAM_BATTING_H. We may choose to leave these out.
vif(stplg_model)
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 6.764826 2.474428 3.059386 4.409563
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_PITCHING_H
## 12.497987 15.072722 1.879750 15.787183
## TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
## 8.998267 11.981906 6.839368 1.354523
Running our performance test shows that our new model with log transformations is our best performing model based Adjusted R-Squares, as well as Weighted AIC and BIC scores.
compare_performance(stp_model, stp_model2, stpb_model3, stpb_model4, stpb_model5, stplg_model, rank = TRUE)
The Least Absolute Shrinkage and Selection Operator (LASSO) selection method uses a shrinkage approach to determining the optimal predictors by attempting to find a balance between simplicity and accuracy.It applies a penalty to the standard linear regression model to encourage the coefficients of features with weak influence to equal zero to prevent overfitting.
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.1-8
x = model.matrix(TARGET_WINS ~ ., data = stplg75_train_df)
y = stplg75_train_df$TARGET_WINS
fit_lasso = glmnet(x, y, alpha=1)
plot(fit_lasso, xvar="lambda", label=TRUE)
cv_lasso = cv.glmnet(x,y,alpha=1)
#plot(cv.lasso)
coef(cv_lasso)
## 17 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1.682957e+01
## (Intercept) .
## TEAM_BATTING_H 4.259232e-02
## TEAM_BATTING_2B -5.748709e-03
## TEAM_BATTING_3B 7.375109e-02
## TEAM_BATTING_HR 2.964076e-02
## TEAM_BATTING_BB 1.965249e-02
## TEAM_BATTING_SO -3.090798e-03
## TEAM_BASERUN_SB 2.787417e-02
## TEAM_BASERUN_CS -5.427011e-03
## TEAM_PITCHING_H 5.621139e+00
## TEAM_PITCHING_HR 4.170862e-03
## TEAM_PITCHING_BB -4.397995e+00
## TEAM_PITCHING_SO 3.584607e+00
## TEAM_FIELDING_E -2.631764e-02
## TEAM_FIELDING_DP -8.957448e-02
## n -1.857381e-04
Performing LASSO on our training subset – 75% of full training set with log transformation applied – gives us 7 predictors: - TEAM_BATTING_H - TEAM_BATTING_3B - TEAM_BATTING_BB - TEAM_BASERUN_SB - TEAM_PITCHING_HR - TEAM_FIELDING_E - TEAM_FIELDING_DP
lasso_model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df)
plot(lasso_model)
summary(lasso_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_3B +
## TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_PITCHING_HR + TEAM_FIELDING_E +
## TEAM_FIELDING_DP, data = stplg75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -54.434 -8.658 -0.028 8.217 56.772
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.440265 4.030071 4.824 1.53e-06 ***
## TEAM_BATTING_H 0.044214 0.002856 15.481 < 2e-16 ***
## TEAM_BATTING_3B 0.078345 0.018757 4.177 3.11e-05 ***
## TEAM_BATTING_BB 0.009191 0.003802 2.418 0.0157 *
## TEAM_BASERUN_SB 0.028170 0.004684 6.014 2.21e-09 ***
## TEAM_PITCHING_HR 0.044837 0.007918 5.663 1.75e-08 ***
## TEAM_FIELDING_E -0.022652 0.002178 -10.401 < 2e-16 ***
## TEAM_FIELDING_DP -0.104915 0.014896 -7.043 2.72e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.1 on 1697 degrees of freedom
## Multiple R-squared: 0.2943, Adjusted R-squared: 0.2914
## F-statistic: 101.1 on 7 and 1697 DF, p-value: < 2.2e-16
vif(lasso_model)
## TEAM_BATTING_H TEAM_BATTING_3B TEAM_BATTING_BB TEAM_BASERUN_SB
## 1.549492 2.574363 2.135160 1.542953
## TEAM_PITCHING_HR TEAM_FIELDING_E TEAM_FIELDING_DP
## 2.353644 2.308619 1.338265
Our VIF test shows no strong correlation between predictors. The Residuals vs. Fitted and QQ Plots show a fairly linear pattern, while Scale-Location plot suggest Homoscedasticity. The Residuals vs Leverage plot reveals that our outliers have less leverage than in the pre- log transformed model.
Here we are comparing the all models using our training subset (75% of observations).
base_model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = stp75_train_df)
high_impact_model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_HR +
TEAM_PITCHING_HR + TEAM_PITCHING_SO + TEAM_FIELDING_E, data = stp75_train_df)
improved_model2 <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_HR + TEAM_BATTING_BB +
TEAM_BATTING_2B + TEAM_BATTING_HR:TEAM_BATTING_BB +
TEAM_PITCHING_SO + TEAM_FIELDING_E,
data = stp75_train_df)
# Apply log transformation to selected variables
stp75lf2_train_df <- stp75_train_df %>%
mutate(
log_BATTING_H = log1p(TEAM_BATTING_H),
log_BATTING_HR = log1p(TEAM_BATTING_HR),
log_PITCHING_SO = log1p(TEAM_PITCHING_SO)
)
stp25lf2_train_df <- stp25_test_df %>%
mutate(
log_BATTING_H = log1p(TEAM_BATTING_H),
log_BATTING_HR = log1p(TEAM_BATTING_HR),
log_PITCHING_SO = log1p(TEAM_PITCHING_SO)
)
log_model2 <- lm(TARGET_WINS ~ log_BATTING_H + log_BATTING_HR + log_PITCHING_SO + TEAM_FIELDING_E, data = stp75lf2_train_df)
compare_performance(base_model2, high_impact_model2, log_model2, improved_model2, stp_model, stp_model2, stpb_model3, stpb_model4, stpb_model5, stplg_model, lasso_model)
If interpretability is most important → Use base Stats Model.
If statistical optimization is preferred → Use High-Impact Model.
If non-linearity is a concern → Use Log-Transformed Model.
If statistical significance is most important → Log-Transformed Step-wise Model
In this section, we will use the test dataframe to test the accuracy of our model’s predictions
# Load necessary libraries
library(Metrics) # For MAE and RMSE
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:performance':
##
## mae, mse, rmse
## The following objects are masked from 'package:caret':
##
## precision, recall
## The following object is masked from 'package:forecast':
##
## accuracy
# Stats Model predictions
predictedWins = predict(base_model2, stp25_test_df)
stp25_test_df["PREDICTED_WINS"] = predictedWins
# High-Impact Model
predictedWins = predict(high_impact_model2, stp25_test_df)
stp25_test_df["PREDICTED_WINS_HIM"] = predictedWins
# Log-Transformed Model
predictedWins = predict(log_model2, stp25lf2_train_df)
stp25lf2_train_df["PREDICTED_WINS_LOG"] = predictedWins
# Log-Transformed Step-wise Model
predictedWins = predict(stplg_model, stplg25_test_df)
stplg25_test_df["PREDICTED_WINS_STP"] = predictedWins
# Create a df to store our results
evaluation_metrics <- data.frame(
model_name = character(0),
MAE = numeric(0),
RMSE = numeric(0),
RSquared = numeric(0)
)
eval_predictions <- function(model, predict_col, target_col, model_name) {
pred = lm(predict_col ~ target_col)
plot(predict_col, target_col, xlab="Actual Wins", ylab="Predicted Wins")
abline(pred)
plot(model$residuals)
model_metrics = data.frame(
model_name = model_name,
MAE = mae(target_col, predict_col),
RMSE = rmse(target_col, predict_col),
RSquared = (cor(target_col, predict_col)^2 )
)
evaluation_metrics <<- rbind(evaluation_metrics, model_metrics)
}
# Evaluate Stats Model predictions
eval_predictions(base_model2, stp25_test_df$PREDICTED_WINS, stp25_test_df$TARGET_WINS, "Stats")
# Improved Model
eval_predictions(improved_model2, stp25lf2_train_df$PREDICTED_WINS_LOG, stp25lf2_train_df$TARGET_WINS, "Improved")
# Evaluate High-Impact Model
eval_predictions(high_impact_model2, stp25_test_df$PREDICTED_WINS_HIM, stp25_test_df$TARGET_WINS, "High-Impact")
# Evaluate Log-Transformed Model
eval_predictions(log_model2, stp25lf2_train_df$PREDICTED_WINS_LOG, stp25lf2_train_df$TARGET_WINS, "Log-Tranformed")
# Evaluate Log-Transformed Step-wise Model
eval_predictions(stplg_model, stplg25_test_df$PREDICTED_WINS_STP, stplg25_test_df$TARGET_WINS, "LT Step-wise")
To evaluate the accuracy of each our model’s predictions, we will
compare: - the Mean Absolute Error (MAE) - the Root Mean Squared Error
(RMSE) - the R-squared (R²)
print(evaluation_metrics)
## model_name MAE RMSE RSquared
## 1 Stats 10.95450 13.96901 0.2278177
## 2 Improved 10.95525 13.89549 0.2368148
## 3 High-Impact 10.88520 13.90523 0.2345308
## 4 Log-Tranformed 10.95525 13.89549 0.2368148
## 5 LT Step-wise 10.11544 13.14682 0.3183422
A quick examination of these shows the Log-Transformed Step-wise Model has the lowest MAE and RMSE as well as the highest R-squared, suggesting that this model is somewhat more accurate than the other four evaluated. For confirmation we will perform a 5-fold cross-validation below:
library(caret)
train_control <- trainControl(method = "cv", number = 5) # 10-fold cross-validation
model_cv <- train(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df, method = "lm", trControl = train_control)
print(model_cv)
## Linear Regression
##
## 1705 samples
## 12 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1365, 1364, 1363, 1363, 1365
## Resampling results:
##
## RMSE Rsquared MAE
## 12.99227 0.3034024 10.07271
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
summary(stplg_model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B +
## TEAM_BATTING_3B + TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
## TEAM_BASERUN_SB + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP, data = stplg75_train_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -52.628 -8.317 -0.193 7.985 61.171
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -63.641214 17.214581 -3.697 0.000225 ***
## TEAM_BATTING_H 0.036536 0.005838 6.258 4.92e-10 ***
## TEAM_BATTING_2B -0.018417 0.010435 -1.765 0.077752 .
## TEAM_BATTING_3B 0.102137 0.020004 5.106 3.66e-07 ***
## TEAM_BATTING_HR 0.049758 0.010735 4.635 3.84e-06 ***
## TEAM_BATTING_BB 0.068369 0.008998 7.598 4.95e-14 ***
## TEAM_BATTING_SO -0.025447 0.005235 -4.861 1.27e-06 ***
## TEAM_BASERUN_SB 0.038504 0.005057 7.613 4.41e-14 ***
## TEAM_PITCHING_H 18.750182 4.318645 4.342 1.50e-05 ***
## TEAM_PITCHING_BB -27.300228 3.821159 -7.144 1.34e-12 ***
## TEAM_PITCHING_SO 18.096505 3.342998 5.413 7.08e-08 ***
## TEAM_FIELDING_E -0.039108 0.003667 -10.664 < 2e-16 ***
## TEAM_FIELDING_DP -0.094437 0.014661 -6.441 1.54e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.82 on 1692 degrees of freedom
## Multiple R-squared: 0.3266, Adjusted R-squared: 0.3218
## F-statistic: 68.38 on 12 and 1692 DF, p-value: < 2.2e-16
MAE and RMSE is even lower when using cross-validation. Although our R-squared value decreased from 0.3183 to 0.3034024, this value is still higher than the model with second highest R-squared value of 0.2368 for the Log-Tranformed model.
Our previous test suggest that the Log-Transformed Step-wise Model appears to be the most accurate at predicting wins over a team’s season. However, with 12 predictors, this model is fairly complex. Given the audience for our report and the manager’s level of interest, we should select a simpler model that is still statistically significant and fairly accurate.
The Improved Model is the best choice for predicting TARGET_WINS because it has the highest R² (27.89%), the lowest residual standard error (13.25), and avoids severe multicollinearity issues.
Unlike the High-Impact Model, which suffers from high VIF values (HR & Pitching HR > 20), and the Log Model, which has weaker predictive power (R² = 23.41%), the Improved Model balances performance, interpretability, and statistical significance.
Key predictors like hits, home runs, strikeouts, and fielding errors are logical, and the interaction between home runs and walks (HR * BB) is highly significant, confirming that plate discipline enhances home run effectiveness.
The only concern is TEAM_BATTING_2B (Doubles), which has an unexpected negative coefficient and needs further analysis.
Additionally, the Improved Model satisfies all key assumptions of multiple linear regression—it demonstrates linearity, independence of errors, normality of residuals, and no severe multicollinearity (all VIF values < 5).
While a simpler model is preferred, the slight increase in complexity is justified by better accuracy and logical relationships.
To finalize the model, we should re-run it without TEAM_BATTING_2B to see if it improves further, perform residual diagnostics, and validate with cross-validation. Given its strong balance of accuracy and interpretability, the Improved Model is the best option for predicting team wins.
Making Predictions Using the Evaluation Dataset:
missing_values <- eval_df %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Missing_Count")
print(missing_values)
## # A tibble: 16 × 2
## Variable Missing_Count
## <chr> <int>
## 1 INDEX 0
## 2 TEAM_BATTING_H 0
## 3 TEAM_BATTING_2B 0
## 4 TEAM_BATTING_3B 0
## 5 TEAM_BATTING_HR 0
## 6 TEAM_BATTING_BB 0
## 7 TEAM_BATTING_SO 18
## 8 TEAM_BASERUN_SB 13
## 9 TEAM_BASERUN_CS 87
## 10 TEAM_BATTING_HBP 240
## 11 TEAM_PITCHING_H 0
## 12 TEAM_PITCHING_HR 0
## 13 TEAM_PITCHING_BB 0
## 14 TEAM_PITCHING_SO 18
## 15 TEAM_FIELDING_E 0
## 16 TEAM_FIELDING_DP 31
eval_df <- eval_df[, !names(eval_df) %in% "INDEX"]
eval_df <- eval_df[, !names(eval_df) %in% "TEAM_BATTING_HBP"]
eval_df
Instead of removing missing values, fill them with the mean or median:
eval_df$TEAM_BATTING_SO[is.na(eval_df$TEAM_BATTING_SO)] <- mean(eval_df$TEAM_BATTING_SO, na.rm = TRUE)
eval_df$TEAM_BASERUN_SB[is.na(eval_df$TEAM_BASERUN_SB)] <- mean(eval_df$TEAM_BASERUN_SB, na.rm = TRUE)
eval_df$TEAM_BASERUN_CS[is.na(eval_df$TEAM_BASERUN_CS)] <- mean(eval_df$TEAM_BASERUN_CS, na.rm = TRUE)
eval_df$TEAM_FIELDING_DP[is.na(eval_df$TEAM_FIELDING_DP)] <- mean(eval_df$TEAM_FIELDING_DP, na.rm = TRUE)
eval_df$TEAM_PITCHING_SO[is.na(eval_df$TEAM_PITCHING_SO)] <- mean(eval_df$TEAM_PITCHING_SO, na.rm = TRUE)
sum(is.na(eval_df)) # Total missing values
## [1] 0
colSums(is.na(eval_df)) # Missing values per column
## TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B TEAM_BATTING_HR
## 0 0 0 0
## TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS
## 0 0 0 0
## TEAM_PITCHING_H TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## 0 0 0 0
## TEAM_FIELDING_E TEAM_FIELDING_DP
## 0 0
skim(eval_df)
| Name | eval_df |
| Number of rows | 259 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| numeric | 14 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| TEAM_BATTING_H | 0 | 1 | 1469.39 | 150.66 | 819 | 1387.0 | 1455.00 | 1548.0 | 2170 | ▁▂▇▁▁ |
| TEAM_BATTING_2B | 0 | 1 | 241.32 | 49.52 | 44 | 210.0 | 239.00 | 278.5 | 376 | ▁▂▇▇▂ |
| TEAM_BATTING_3B | 0 | 1 | 55.91 | 27.14 | 14 | 35.0 | 52.00 | 72.0 | 155 | ▇▇▃▁▁ |
| TEAM_BATTING_HR | 0 | 1 | 95.63 | 56.33 | 0 | 44.5 | 101.00 | 135.5 | 242 | ▆▅▇▃▁ |
| TEAM_BATTING_BB | 0 | 1 | 498.96 | 120.59 | 15 | 436.5 | 509.00 | 565.5 | 792 | ▁▁▅▇▁ |
| TEAM_BATTING_SO | 0 | 1 | 709.34 | 234.48 | 0 | 565.0 | 709.34 | 904.5 | 1268 | ▁▃▇▆▂ |
| TEAM_BASERUN_SB | 0 | 1 | 123.70 | 91.00 | 0 | 60.5 | 96.00 | 149.0 | 580 | ▇▅▁▁▁ |
| TEAM_BASERUN_CS | 0 | 1 | 52.32 | 18.81 | 0 | 44.0 | 52.32 | 56.0 | 154 | ▁▇▂▁▁ |
| TEAM_PITCHING_H | 0 | 1 | 1813.46 | 1662.91 | 1155 | 1426.5 | 1515.00 | 1681.0 | 22768 | ▇▁▁▁▁ |
| TEAM_PITCHING_HR | 0 | 1 | 102.15 | 57.65 | 0 | 52.0 | 104.00 | 142.5 | 336 | ▇▇▆▁▁ |
| TEAM_PITCHING_BB | 0 | 1 | 552.42 | 172.95 | 136 | 471.0 | 526.00 | 606.5 | 2008 | ▆▇▁▁▁ |
| TEAM_PITCHING_SO | 0 | 1 | 799.67 | 611.78 | 0 | 622.5 | 782.00 | 927.5 | 9963 | ▇▁▁▁▁ |
| TEAM_FIELDING_E | 0 | 1 | 249.75 | 230.90 | 73 | 131.0 | 163.00 | 252.0 | 1568 | ▇▁▁▁▁ |
| TEAM_FIELDING_DP | 0 | 1 | 146.06 | 24.28 | 69 | 134.5 | 146.06 | 160.5 | 204 | ▁▂▇▆▁ |
# Generate predictions using the trained Improved Model
eval_df$PREDICTED_WINS <- predict(improved_model, newdata = eval_df)
# View a sample of predictions
head(eval_df$PREDICTED_WINS)
## [1] 69.15118 71.73961 80.06122 82.52091 75.19106 72.10608
The predicted values represent the expected number of wins for teams based on their batting, pitching, and fielding statistics using the Improved Model.
For example:
A team with a predicted 69.15 wins is expected to win around 69 games in a full season. A team with a predicted 82.52 wins is expected to perform better, likely finishing with around 82-83 wins. The variation in predictions suggests that some teams are stronger than others based on key performance metrics (hits, home runs, walks, strikeouts, fielding errors, etc.).
Now that we have predicted TARGET_WINS, we need to compare these predictions with the actual values in the TARGET_WINS column to assess model accuracy.
# Check structure of the dataset
str(eval_df)
## 'data.frame': 259 obs. of 15 variables:
## $ TEAM_BATTING_H : int 1209 1221 1395 1539 1445 1431 1430 1385 1259 1397 ...
## $ TEAM_BATTING_2B : int 170 151 183 309 203 236 219 158 177 212 ...
## $ TEAM_BATTING_3B : int 33 29 29 29 68 53 55 42 78 42 ...
## $ TEAM_BATTING_HR : int 83 88 93 159 5 10 37 33 23 58 ...
## $ TEAM_BATTING_BB : int 447 516 509 486 95 215 568 356 466 452 ...
## $ TEAM_BATTING_SO : num 1080 929 816 914 416 377 527 609 689 584 ...
## $ TEAM_BASERUN_SB : num 62 54 59 148 124 ...
## $ TEAM_BASERUN_CS : num 50 39 47 57 52.3 ...
## $ TEAM_PITCHING_H : int 1209 1221 1395 1539 3902 2793 1544 1626 1342 1489 ...
## $ TEAM_PITCHING_HR: int 83 88 93 159 14 20 40 39 25 62 ...
## $ TEAM_PITCHING_BB: int 447 516 509 486 257 420 613 418 497 482 ...
## $ TEAM_PITCHING_SO: num 1080 929 816 914 1123 ...
## $ TEAM_FIELDING_E : int 140 135 156 124 616 572 490 328 226 184 ...
## $ TEAM_FIELDING_DP: num 156 164 153 154 130 ...
## $ PREDICTED_WINS : num 69.2 71.7 80.1 82.5 75.2 ...
# Check if ACTUAL_WINS and PREDICTED_WINS are numeric
is.numeric(eval_df$ACTUAL_WINS)
## [1] FALSE
is.numeric(eval_df$PREDICTED_WINS)
## [1] TRUE
# Add a row index to both datasets
train_df$INDEX <- seq_len(nrow(train_df))
eval_df$INDEX <- seq_len(nrow(eval_df))
# First, check if both datasets have a common identifier (like INDEX)
head(train_df$INDEX)
## [1] 1 2 3 4 5 6
head(eval_df$INDEX)
## [1] 1 2 3 4 5 6
# Merge train_df and eval_df using INDEX (if applicable)
merged_df <- merge(train_df[, c("INDEX", "TARGET_WINS")],
eval_df[, c("INDEX", "PREDICTED_WINS")],
by = "INDEX", all = FALSE)
# Now check if both columns have equal length
nrow(merged_df)
## [1] 259
Ensure data consistency before calculating accuracy metrics. Convert to numeric to avoid errors in correlation or calculations.
# Load necessary libraries
library(Metrics) # For MAE and RMSE
library(ggplot2) # For visualizations
# Compute MAE, RMSE, and R²
mae <- mae(merged_df$TARGET_WINS, merged_df$PREDICTED_WINS)
rmse <- rmse(merged_df$TARGET_WINS, merged_df$PREDICTED_WINS)
r_squared <- cor(merged_df$TARGET_WINS, merged_df$PREDICTED_WINS)^2
# Print results
cat("Model Accuracy Metrics:\n")
## Model Accuracy Metrics:
cat("Mean Absolute Error (MAE):", round(mae, 2), "\n")
## Mean Absolute Error (MAE): 13.87
cat("Root Mean Squared Error (RMSE):", round(rmse, 2), "\n")
## Root Mean Squared Error (RMSE): 17.38
cat("R-Squared (R²):", round(r_squared, 4), "\n")
## R-Squared (R²): 0.001
The evaluation results indicate that our Improved Model is not performing well in predicting TARGET_WINS.
The Mean Absolute Error (MAE) of 13.87 suggests that, on average, predictions deviate by about 14 wins per team, which is quite high.
Additionally, the Root Mean Squared Error (RMSE) of 17.38 implies that some predictions have even larger errors, highlighting potential inconsistencies or missing key predictors.
Most concerning is the R-Squared (R²) value of 0.001, which means that the model explains almost none of the variation in team wins—essentially, the predictions are no better than random guesses.
This suggests that the model may be overfitting the training data and failing to generalize, or that it lacks critical predictive features.
To improve performance, we should consider adding more relevant predictors (such as ERA or OBP), removing unimportant or noisy variables, and potentially using alternative mo
With our models tested and evaluated, we can no apply our model to our final evaluation dataframe.
predictedWins = predict(stplg_model, eval_df)
eval_df["PREDICTED_WINS"] = predictedWins